2024-02-27
retrodesign() functionSuppose you read a study, and the result appears to not meet the standard of statistical significance that you wish it did. The idea is to show that a “non-significant” hypothesis test failed to achieve significance because it wasn’t powerful enough.
Post Hoc Power Calculations Are Not Useful from University of Virginia’s Research Data Services cites these papers.
This last piece is from The American Statistician and is entitled “The abuse of power: The pervasive fallacy of power calculations for data analysis.” Much wisdom there.
Blog is Statistical Modeling, Causal Inference, and Social Science
Headline Finding: A sample of ~500 men from America and India shows a significant relationship between sexist views and the presence of facial hair.
Excerpt 1:
Since a linear relationship has been found between facial hair thickness and perceived masculinity . . . we explored the relationship between facial hair thickness and sexism. . . . Pearson’s correlation found no significant relationships between facial hair thickness and hostile or benevolent sexism, education, age, sexual orientation, or relationship status.
Excerpt 2:
We conducted pairwise comparisons between clean-shaven men and each facial hair style on hostile and benevolent sexism scores. . . . For the purpose of further analyses, participants were classified as either clean-shaven or having facial hair based on their self- reported facial hair style . . . There was a significant Facial Hair Status by Sexism Type interaction . . .
So their headline finding appeared only because, after their first analysis failed, they shook and shook the data until they found something statistically significant.
Gelman:
When doing a power calculation, how do people specify an effect size of interest? Two main approaches…
When doing a power calculation, how do people specify an effect size of interest? Two main approaches…
Both approaches lead to performing studies that are too small or misinterpretation of findings after completion.
You collected data and analyzed the results. Now you want to do an after data gathering (post hoc) power analysis.
A broader notion of design, though, can be useful before and after data are gathered.
Gelman and Carlin recommend design calculations to estimate
These ideas can (and should) have value both before data collection/analysis and afterwards (especially when an apparently strong and significant effect is found.)
You perform a study that yields estimate d with standard error s. Think of d as an estimated mean difference, for example.
You perform a study that yields estimate d with standard error s. Think of d as an estimated mean difference, for example.
Inputs to the function:
D, the hypothesized true effect sizes, the standard error of the estimatealpha, the statistical significance threshold (default 0.05)df, the degrees of freedom (default assumption: infinite)Output:
retrodesign <- function(D, s, alpha=.05, df=Inf, n.sims=10000)
{
z <- qt(1-alpha/2, df)
p.hi <- 1 - pt(z-D/s, df)
p.lo <- pt(-z-D/s, df)
power <- p.hi + p.lo
typeS <- p.lo/power
estimate <- D + s*rt(n.sims,df)
significant <- abs(estimate) > s*z
exaggeration <- mean(abs(estimate)[significant])/D
return(list(power=power, typeS=typeS,
exaggeration=exaggeration))
}Suppose the true effect is 2.8 standard errors away from zero, in a study built to have 80% power for that effect with 95% confidence.
| power | typeS | exaggeration |
|---|---|---|
| 0.79956 | \(1.21 \times 10^{-6}\) | 1.13 |
retrodesign for Zero EffectKanazawa study of 2972 respondents from the National Longitudinal Study of Adolescent Health
We need to postulate an effect size, which will not be 8 percentage points. Instead, Gelman and colleagues hypothesized a range of true effect sizes using the scientific literature.
There is a large literature on variation in the sex ratio of human births, and the effects that have been found have been on the order of 1 percentage point (for example, the probability of a girl birth shifting from 48.5 percent to 49.5 percent).
Variation attributable to factors such as race, parental age, birth order, maternal weight, partnership status and season of birth is estimated at from less than 0.3 percentage points to about 2 percentage points, with larger changes (as high as 3 percentage points) arising under economic conditions of poverty and famine.
(There are) reliable findings that male fetuses (and also male babies and adults) are more likely than females to die under adverse conditions.
So, Gelman and colleagues hypothesized three potential effect sizes (0.1, 0.3 and 1.0 percentage points) and under each effect size, considered what might happen in a study with sample size equal to Kanazawa’s study.
qnorm(p = 0.015/2, lower.tail=FALSE) = 2.432Assuming the true difference is 0.1 means that probability of girl births differs by 0.1 percentage points, comparing attractive with unattractive parents.
If the estimate is statistically significant, then:
Assuming the true difference is 0.1 means that probability of girl births differs by 0.1 percentage points, comparing attractive with unattractive parents.
If the estimate is statistically significant, then:
Assuming the true difference is 0.1 means that probability of girl births differs by 0.1 percentage points, comparing attractive with unattractive parents.
If the estimate is statistically significant, then:
Assuming the true difference is 0.1 means that probability of girl births differs by 0.1 percentage points, comparing attractive with unattractive parents.
If the estimate is statistically significant, then:
| Assumption | Power | Type S | Exaggeration Ratio |
|---|---|---|---|
| D = 0.1 | 0.05 | 0.46 | 78 |
| D = 0.3 | 0.05 | 0.39 | 25 |
| D = 1.0 | 0.06 | 0.19 | 7.8 |
Under a true difference of 1.0 percentage point, there would be
Their effect size is tiny and their measurement error is huge. My best analogy is that they are trying to use a bathroom scale to weigh a feather … and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.
In advance, and after the fact, think hard about what a plausible effect size might be. Then…
Based on some reasonable assumptions regarding main effects and interactions (specifically that the interactions are half the size of the main effects), you need 16 times the sample size to estimate an interaction that you need to estimate a main effect.
And this implies a major, major problem with the usual plan of designing a study with a focus on the main effect, maybe even preregistering, and then looking to see what shows up in the interactions.
Or, even worse, designing a study, not finding the anticipated main effect, and then using the interactions to bail you out. The problem is not just that this sort of analysis is “exploratory”; it’s that these data are a lot noisier than you realize, so what you think of as interesting exploratory findings could be just a bunch of noise.
crimestat dataFor each of 51 states (including the District of Columbia), we have the state’s ID number, postal abbreviation and full name, as well as:
crimestat data set# A tibble: 51 × 6
sid state crime poverty single state.full
<dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 1 AL 427. 19.2 9.02 Alabama
2 2 AK 636. 11.4 7.63 Alaska
3 3 AZ 400. 18.2 8.31 Arizona
4 4 AR 480. 18.7 9.41 Arkansas
5 5 CA 396. 16.4 7.25 California
6 6 CO 309. 12.1 6.75 Colorado
7 7 CT 237. 10.8 8.04 Connecticut
8 8 DE 489. 13 6.52 Delaware
9 9 DC 1244. 18.4 8.41 District of Columbia
10 10 FL 540. 16.6 8.29 Florida
# ℹ 41 more rows
crime with poverty and singleOur main goal will be to build a linear regression model to predict crime using centered versions of both poverty and single.
Note the sneaky trick with the outside parentheses…
tidy(mod1, conf.int = TRUE) |>
select(term, estimate, std.error,
p.value, conf.low, conf.high) |>
gt() |> fmt_number(decimals = 3) |> tab_options(table.font.size = 24)| term | estimate | std.error | p.value | conf.low | conf.high |
|---|---|---|---|---|---|
| (Intercept) | 364.406 | 22.933 | 0.000 | 318.297 | 410.515 |
| pov_c | 16.115 | 9.616 | 0.100 | −3.219 | 35.448 |
| single_c | 23.843 | 18.384 | 0.201 | −13.121 | 60.807 |
Several points show up in the residual plots.
There are several ways to do robust linear regression using M-estimation, including weighting using Huber and bisquare strategies.
We’ll fit the model, using the default weighting choice: what are called Huber weights, where observations with small residuals get a weight of 1, and the larger the residual, the smaller the weight.
MASS::rlm)| term | estimate | std.error | statistic |
|---|---|---|---|
| (Intercept) | 343.798 | 13.131 | 26.182 |
| pov_c | 11.910 | 5.506 | 2.163 |
| single_c | 30.987 | 10.527 | 2.944 |
Now, both predictors appear to have estimates that exceed twice their standard error. So this is a very different result than ordinary least squares gave us.
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.197 | 0.163 | 163.771 | 5.883 | 0.005 | 2.000 | −330.842 | 669.684 | 677.411 | 1,287,404.929 | 48.000 | 51.000 |
| sigma | converged | logLik | AIC | BIC | deviance | nobs |
|---|---|---|---|---|---|---|
| 59.145 | TRUE | −331.378 | 670.757 | 678.484 | 1,314,783.620 | 51.000 |
Let’s augment the data with results from this model, including the weights.
crime_with_huber <- augment(rob.huber, crimestat) |>
mutate(w = rob.huber$w) |> arrange(w)
crime_with_huber |>
select(sid, state, w, crime, pov_c, single_c, everything()) |>
head()# A tibble: 6 × 15
sid state w crime pov_c single_c poverty single state.full .fitted
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 9 DC 0.0951 1244. 3.53 0.721 18.4 8.41 District of … 408.
2 2 AK 0.237 636. -3.47 -0.0588 11.4 7.63 Alaska 301.
3 29 NV 0.278 636. 0.527 -0.0288 15.4 7.66 Nevada 349.
4 25 MS 0.303 278. 7.03 3.67 21.9 11.4 Mississippi 541.
5 43 TN 0.321 608. 3.33 -0.749 18.2 6.94 Tennessee 360.
6 46 VT 0.382 99.3 -2.67 -0.139 12.2 7.55 Vermont 308.
# ℹ 5 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
# .std.resid <dbl>
Call: rlm(formula = crime ~ pov_c + single_c, data = crimestat)
Residuals:
Min 1Q Median 3Q Max
-262.751 -45.641 1.762 36.732 836.244
Coefficients:
Value Std. Error t value
(Intercept) 343.7982 13.1309 26.1823
pov_c 11.9098 5.5058 2.1631
single_c 30.9868 10.5266 2.9437
Residual standard error: 59.14 on 48 degrees of freedom
As mentioned there are several possible weighting functions - we’ll next try the biweight, also called the bisquare or Tukey’s bisquare, in which all cases with a non-zero residual get down-weighted at least a little.
Here is the resulting fit…
Call:
rlm(formula = crime ~ pov_c + single_c, data = crimestat, psi = psi.bisquare)
Converged in 13 iterations
Coefficients:
(Intercept) pov_c single_c
336.17015 10.31578 34.70765
Degrees of freedom: 51 total; 48 residual
Scale estimate: 67.3
Let’s augment the data, as above
crime_with_biweights <-
augment(rob.biweight, newdata = crimestat) |>
mutate(w = rob.biweight$w) |>
arrange(w)
head(crime_with_biweights, 3)# A tibble: 3 × 11
sid state crime poverty single state.full pov_c single_c .fitted .resid
<dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 AK 636. 11.4 7.63 Alaska -3.47 -0.0588 298. 337.
2 9 DC 1244. 18.4 8.41 District of C… 3.53 0.721 398. 847.
3 29 NV 636. 15.4 7.66 Nevada 0.527 -0.0288 341. 295.
# ℹ 1 more variable: w <dbl>
Again, cases with large residuals (in absolute value) are down-weighted generally, but here, Alaska and Washington DC receive no weight at all in fitting the final model.
Call: rlm(formula = crime ~ pov_c + single_c, data = crimestat, psi = psi.bisquare)
Residuals:
Min 1Q Median 3Q Max
-257.58 -40.53 8.01 45.30 846.81
Coefficients:
Value Std. Error t value
(Intercept) 336.1702 12.6733 26.5259
pov_c 10.3158 5.3139 1.9413
single_c 34.7077 10.1598 3.4162
Residual standard error: 67.27 on 48 degrees of freedom
# A tibble: 1 × 6
r.squared adj.r.squared sigma statistic p.value df
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.197 0.163 164. 5.88 0.00518 2
# A tibble: 1 × 6
logLik AIC BIC deviance df.residual nobs
<dbl> <dbl> <dbl> <dbl> <int> <int>
1 -331. 670. 677. 1287405. 48 51
# A tibble: 1 × 7
sigma converged logLik AIC BIC deviance nobs
<dbl> <lgl> <logLik> <dbl> <dbl> <dbl> <int>
1 67.3 TRUE -331.8601 672. 679. 1339850. 51
# A tibble: 1 × 7
sigma converged logLik AIC BIC deviance nobs
<dbl> <lgl> <logLik> <dbl> <dbl> <dbl> <int>
1 59.1 TRUE -331.3785 671. 678. 1314784. 51
We can use the rq function in the quantreg package to model the median of our outcome (violent crime rate) on the basis of our predictors, rather than the mean, as is the case in ordinary least squares.
In fact, if we like, we can estimate any quantile by specifying the tau parameter (here tau = 0.5, by default, so we estimate the median.)
Estimating the Mean
| Fit | Intercept CI | pov_c CI |
single_c CI |
|---|---|---|---|
| OLS | (318.6, 410.2) | (-3.13, 35.35) | (-12.92, 60.6) |
| Robust (Huber) | (320, 367.6) | (0.89, 22.93) | (9.93, 52.05) |
| Robust (biweight) | (310.7, 361.5) | (-0.3, 20.94) | (14.39, 55.03) |
Note: CIs estimated for OLS and Robust methods as point estimate \(\pm\) 2 standard errors
Estimating the Median
| Fit | Intercept CI | pov_c CI |
single_c CI |
|---|---|---|---|
| Quantile (Median) Reg | (336.9, 366.2) | (3.07, 28.96) | (4.46, 48,19) |
| Fit | AIC | BIC |
|---|---|---|
| OLS | 669.7 | 677.4 |
| Robust (Huber) | 670.8 | 678.5 |
| Robust (biweight) | 671.7 | 679.4 |
| Quantile (median) | 637.5 | 643.3 |
432 Class 13 | 2024-02-27 | https://thomaselove.github.io/432-2024/